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^ ' With the aim to study the relationship between protein sequences and their 

S ■ native structures, we adopt vectorial representations for both sequence and struc- 

3 ' ture. The structural representation is based on the Principal Eigenvector of the 

fold's contact matrix (PE). As recently shown, the latter encodes sufficient infor- 
mation for reconstructing the whole contact matrix. The sequence is represented 
through a Hydrophobicity Profile (HP), using a generalized hydrophobicity scale 
that we obtain from the principal eigenvector of a residue-residue interaction 
matrix and denote it as interactivity scale. Using this novel scale, we define the 
QQ ' optimal HP of a protein fold, and predict, by means of stability arguments, that 

it is strongly correlated with the PE of the fold's contact matrix. This predic- 
tion is confirmed through an evolutionary analysis, which shows that the PE 
C\ ' correlates with the HP of each individual sequence adopting the same fold and, 

even more strongly, with the average HP of this set of sequences. Thus, protein 
sequences evolve in such a way that their average HP is close to the optimal one, 
implying that neutral evolution can be viewed as a kind of motion in sequence 
space around the optimal HP. Our results indicate that the correlation coefficient 
^ ' between A'^- dimensional vectors constitutes a natural metric in the vectorial space 

CO , in which we represent both protein sequences and protein structures, which we 

^^ ■ call Vectorial Protein Space. In this way, we define a unified framework for se- 

quence to sequence, sequence to structure, and structure to structure alignments. 
We show that the interactivity scale is nearly optimal both for the comparison 
of sequences with sequences and sequences with structures. 
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INTRODUCTION 

The information contained in protein sequences 
can be represented by intrinsic profiles, such as 
hydrophobicity [1,2], charge, and secondary struc- 
ture propensities. Structural information can also 
be reduced to one-dimensional profiles describing 
structural and chemical properties of the amino 
acids [3], including secondary structure and sol- 
vent accessibility [4]. It has been shown that the 
Hydrophobicity Profile (HP) is correlated with the 
solvent accessibility of the native structure [5] , in- 
dicating that sequence and structure profiles are 
intimately related [6,7]. 

In order to gain further insight into the 
sequence-structure relationship, we represent both 
protein sequences and protein structures as vec- 
tors in an iV-dimensional space, denoted as Vec- 
torial Protein Space, and study their mutual re- 
lationship. In this work, we adopt a structural 
representation of proteins based on the Principal 
Eigenvector of the native contact matrix (PE). The 
PE has already been used as an indicator of pro- 
tein topology, in particular as a mean of identifying 
structural domains [8] and clusters of amino acids 
with special structural significance [9,10]. We have 
recently shown that knowledge of the PE suffices to 
reconstruct the complete contact matrix of single- 
domain globular proteins [11]. 

Protein sequences are here represented through 
their HP [2]. We introduce a new generalized hy- 
drophobicity scale that is based on the principal 
eigenvector of a residue-residue interaction matrix. 
This scale, which we call interactivity, correlates 
strongly with empirical hydrophobicity scales, even 
if it incorporates also other interactions besides the 
hydrophobic effect. The inclusion of the native 
fold's effective free energy in the new interactiv- 
ity scale is crucial for deducing the relationship 
between PE and HP. 

We define the optimal HP for a given fold and 
show that a strong correlation can be expected be- 
tween this optimal HP and the PE. Thus, the op- 
timal HP carries the "hydrophobic fingerprint" of 
a protein fold. While we do not expect that the 
optimal HP is realized in protein evolution, we do 
expect that protein evolution is confined in a re- 
gion of sequence space centered around the opti- 
mal HP, so that the evolutionary average of the 
HP over sequences sharing the same fold should 
more strongly resemble the optimal HP. 

To test this expectation, we performed an evo- 
lutionary analysis using two kinds of protein sets: 
(i) the PFAM [12] and FSSP [13] databases. 



containing sequences of protein families adopting 
the same fold, and (ii) the SCN database, con- 
taining sequences of seven protein families ob- 
tained through the Structurally Constrained Neu- 
tral (SCN) model of evolution [14-16] that imposes 
the conservation of the thermodynamic stability of 
the native fold. For all three databases we found 
that the HP of individual sequences are positively 
correlated to the PE, yet the evolutionary average 
of the HP is more strongly correlated to it than 
each sequence individually. This finding suggests 
that the HP of a protein sequence encodes a large 
part of the information on the PE of the native con- 
tact matrix and hence on the native structure. The 
correlation between the average HP and the PE 
allows deriving a site-specific pattern of sequence 
conservation, which can be used in evolutionary 
studies. 

Representing protein structures in the same vec- 
tor space as protein sequences permits to define a 
distance measure, based on the correlation coef- 
ficient, which is applicable to sequence-sequence, 
structure-structure, as well as sequence-structure 
comparisons. This definition provides a unified 
framework for all kinds of protein alignments. In 
this context, we compared the similarity score 
based on the Hydrophobicity Correlation (HC), al- 
ready proposed by Sweet and Eiscnberg [2], with 
the widely used BloSum [17] score, using as bench- 
mark a set of distantly related proteins sharing the 
TIM barrel fold [18]. We also derived indepen- 
dently optimal parameters for sequence-structure 
and for sequence-sequence comparisons. In both 
cases, the interactivity parameters are almost iden- 
tical with the optimal ones. 



METHODS 
The contact matrix and its spectral properties 

The contact matrix C is a binary matrix, with 
elements Cy = 1 if amino acids at positions i and 
j are in contact, and otherwise. Only residues 
separated by at least three positions along the se- 
quence are considered in contact, so that C'ij — 
for all i, j with \i — j\ < 3. Two residues are con- 
sidered to be in contact if any two of their heavy 
atoms are closer than 4.5 A in space. Therefore, 
the contact condition depends on the size of the 
amino acids at positions i and j. 

Since the contact matrix is an iV x iV symmetric 
matrix, it has N real eigenvalues A^, a ~ I, . . . , N , 



which we rank in decreasing order. The corre- 
sponding eigenvectors {c'"'} form an orthonornial 
system.^ 



Effective connectivity and principal 

eigenvector 

From the contact matrix C, we define a vecto- 
rial representation c(C), which we call the effective 
connectivity, such that positions i with large Ci(C) 
are in contact with as many as possible positions 
j with large Cj{C). This self-consistent definition 
can be formally expressed through the condition 
that the vector c(C) maximizes the quadratic form 



E^: 



(1) 



with the normalization condition X]i=i "^f = 1- 
The solution of this maximization problem is the 
Principal Eigenvector (PE) of the contact ma- 
trix corresponding to the largest eigenvalue Ai, 
c(C) — c^^\ In the following, we denote the PE 
by c instead of c*^^). 

From this maximization property, it follows that 
the largest positive eigenvalue Ai has a value rang- 
ing between the average number of contacts per 
residue, "^ij Cij /N , and the maximal number of 

contacts of any given residue, max^ I ^ ■ Cij ) [19]. 

In addition, since all elements of C are positive or 
zero, the PE has all components of the same sign 
or zero. We choose by convention the positive sign. 
PE components are zero only for residues that do 
not form contacts with residues with non- vanishing 
PE. Vanishing PE components may indicate a do- 
main decomposition of the protein structure. The 
components of the PE outside the main domain 
are never exactly zero, but their value is much 
smaller than inside the main domain. The algo- 
rithm for automatic domain decomposition pro- 
posed by Holm and Sander [8] is based on a sim- 
ilar idea. For this reason, multi-domain proteins 
are expected to have a larger variance of their PE 
components than single-domain ones. A study of 
the Protein Data Bank (PDB) has confirmed this 
expectation (data not shown). Therefore, we use 
as a signature for single-domain proteins a small 
relative variance of the PE, the latter defined as 



^In the following, we indicate vectors and matrices 
using bold face letters. 



B 



E.(c.-(c))^_l-iV(cy 



N{cy 
where (c) = N^^ J2i ^i 



N{cy 



(2) 



Similarity measure 

Both the PE and the HP are vectors in an 
iV-dimensional space, where N is the number of 
amino acids. The most natural way of defining 
a similarity measure in this space is through the 
normalized scalar product: The cosine of the an- 
gle a between two vectors x and y is defined 
as cos a == Y^^iyil ^/Y.t^lY.iVf- This quan- 
tity, however, is strongly dependent on the aver- 
age value of the two vectors, (x) = N^^^^Xi and 
(j/) ~ N~^ 'YliiVi- It is therefore more convenient 
to use as similarity measure Pearson's correlation 
coefficient r(x,y), which is the normalized scalar 
product of the vectors with components Xi — (x\ 
and Hi — {%y and is defined as 



?'(x,y) 



Y.^ {^^ - {^)) {y^ " (v)) 



E^ {^^ - (x))^ T,^ {Vr " {v) 



(3) 



Effective folding free energy 

We will use in the following a simple model of 
protein thermodynamics. In this model, the free 
energy of a sequence A folded into a contact map C 
is approximated by an effective contact free energy 
function E{A,C), 



E{A,C) 
knT 



2_^Cij U(Ai,Aj 



(4) 



i<j 



where U is a 20 x 20 symmetric matrix with U{a, b) 
representing the effective interaction, in units of 
kBT, of amino acids a and b when they are in con- 
tact. We use the interaction matrix derived by 
BastoUa et al. [20]. 



The SCN model of protein evolution 

The Structurally Constrained Neutral (SCN) 
model [14-16] is based on mutations and purify- 
ing selection. Selection is imposed requiring that 



the folding free energy and the normahzed energy 
gap of the native structure, calculated through the 
effective free energy function Eq. (4) [20] , are above 
predetermined thresholds. The model reproduces 
the main qualitative features of protein sequence 
evolution and allows structurally based sequence 
conservation for specific protein folds to be pre- 
dicted [14-16]. The conservation values estimated 
through the SCN model are in agreement with 
those measured in the corresponding FSSP fam- 
ily, with the exception of functionally constrained 
positions, which are not conserved in our model 
[14]. 



Sequence Databases 

We studied in detail seven protein folds with 
different length N: The TIM barrel {N = 247, 
PDB code 7tim_A), the ubiquitin conjugating en- 
zyme {N — 160, PDB code lu9a_A), myoglobin 
{N = 151, PDB code la6g), lysozymc (N = 129, 
PDB code 31zt), ribonuclease {N = 124, PDB 
code 7rsa), cytochrome c {N ~ 82, PDB code 
451c), rubrcdoxin {N = 53, mcsophilic: PDB code 
liro; thermophilic: PDB code lbrf_A). 

For each fold, we collected three families of 
aligned sequences expected to belong to the same 
structural class: (i) The PFAM family [12], 
grouped through sequence comparison techniques, 
(ii) The FSSP family [13], grouped through struc- 
ture comparison techniques, (iii) The SCN family 
[14-16], obtained by simulating molecular evolu- 
tion with the constraint that the thermodynamic 
stability of the native structure must be conserved. 



We also determined two new sets of 20 normal- 
ized parameters by maximizing the two scores (a) 
and (b) respectively, averaged over a training set of 
proteins. For the optimization we adopted an ad 
hoc method based on the fact that a generic pa- 
rameter set can be written as a linear combination 
of the 20 eigenvectors of the interaction matrix U. 
For parameter sets consisting of a single eigenvec- 
tor, the score is large for the principal eigenvector, 
medium for a few eigenvectors and low for all other 
ones. Therefore, it is reasonable to expect that the 
optimal set will be a combination of a small num- 
ber of eigenvectors. Our method works in three 
steps, (i) For all 19 • 18 • 17/6 = 969 combina- 
tions of four eigenvectors of U (always including 
the principal one), we maximized the average score 
as a function of the three corresponding coefficients 
(one is fixed by the normalization condition), us- 
ing exact enumeration with large steps, (ii) We 
selected the six combinations of four eigenvectors 
giving the largest scores and we optimized the co- 
efficients using smaller steps, (iii) We checked that 
addition of each of the remaining eigenvectors did 
not improve the result significantly. 

In case (a), we used as training set the single- 
domain globular proteins described below. In 
case (b), the similarity score r(h, h') was cal- 
culated for pairs of distantly related sequences 
with the TIM barrel fold. In both cases, the 
optimization was also performed with standard 
Monte Carlo techniques, yielding a very similar re- 
sult. In case (b), we did not optimize the place- 
ment of gaps at the same time, but we used 
gapped alignments obtained through structural 
superposition downloaded from the Dali server 
(http: //www. ebi . ac .uk/dali/f ssp/). 



Optimization of hydrophobicity parameters 

In this work, we use generalized hydrophobicity 
parameters that are obtained from the principal 
eigenvector of the interaction matrix represented 
in Eq. (4) [20,21]. We call this set of parame- 
ters the interactivity parameters, and we use them 
to obtain a vectorial representation of protein se- 
quences. Protein structures are also represented 
as vectors through the Principal Eigenvector (PE) 
of their contact matrix. We will show here that 
the interactivity parameters simultaneously confer 
high similarity score, i.e. high correlation coeffi- 
cient, (a) to pairs of vectors representing a protein 
sequence and its native structure, and (b) to pairs 
of distantly related sequences sharing the same 
structure (see below). 



PDB set 

We computed the correlation between HP and 
PE for a subset of non-redundant single-domain 
globular proteins (404 single-domain structures of 
length A^ < 200). We tested the condition of globu- 
larity by imposing that the fraction of contacts per 
residue was larger than a length dependent thresh- 
old, Nc/N > 3.5 + 7.8,N-'^/^. This hmctional form 
represents the scaling of the number of contacts 
in globular proteins as a function of chain length 
(the factor N^^^^ comes from the surface to vol- 
ume ratio), and the two parameters were chosen 
so to eliminate outliers with respect to the general 
trend, which are mainly non-globular structures. 
The condition of being single-domain was imposed 



by checking that the variance of the PE compo- 
nents, Eq. (2), was B < 1.5. 



RESULTS 

Optimal HP 

We investigate the relationship between protein 
sequence and structure using an effective folding 
free energy function, based on the pairwise in- 
teraction matrix U, see Eq. (4). This interac- 
tion matrix can be written in spectral form as 
U{a,b) = X]a=i ^a"^"''(^)"'"H^)7 where £„ are 
the eigenvalues, ranked by their absolute value, 
and u^"' are the corresponding eigenvectors. The 
contribution of the principal eigenvector u*-^-* to 
the spectral decomposition, ei u(^^(a) ^'-^^(6), with 
ei < 0, constitutes the main component of the in- 
teraction matrix. It has correlation coefficient 0.81 
with the elements U{a, b) of the full matrix. Thus, 
an approximated effective energy function, yield- 
ing a good approximation to the full contact en- 
ergy Eq. (4), can be obtained as 



g(A,C) 



ei Yl ^'^ ^(^») ^*^^j 



(5) 



i<j 



where the set of parameters h{a) = u^^''{a) coin- 
cide with the main eigenvector of the interaction 
matrix. 

It has been shown by Li et al. [22] that the eigen- 
vector of the Miyazawa and Jernigan contact inter- 
action matrix [23] corresponding to the largest (in 
absolute value) eigenvalue is related to hydropho- 
bicity, having a correlation coefRcient of 0.77 with 
an empirical hydrophobicity scale. For the interac- 
tion matrix used in this study [20] , the correspond- 
ing eigenvector, u'-^^ has a correlation coefRcient 
of 0.85 with the Fauchere and Pliska hydrophobic- 
ity scale [24] (cf. Table I). 

We call the main eigenvector of the interaction 
matrix, h{a) = u^^'{a), the interactivity oi the cor- 
responding residues a. These parameters are based 
also on other interactions besides the hydrophobic 
effect, e.g. electrostatic interactions, but the hy- 
drophobicity is their main component, as indicated 
by the strong correlation between the interactivity 
and the Fauchere and Pliska hydrophobicity scale. 
Therefore, we will call the iV-dimensional vector 
h(A) the Hydrophobicity Profile of sequence A, 
abbreviated in the following as HP. 



A 


0.1366 


0.0728 


0.1510 


0.31 


E 


-0.0484 


-0.0295 


-0.0639 


-0.64 


Q 


0.0325 


0.0126 


0.0246 


-0.22 


D 


-0.1233 


-0.0552 


0.0047 


-0.77 


N 


-0.0345 


-0.0390 


0.0381 


-0.60 


L 


0.4251 


0.3819 


0.3926 


1.70 


G 


-0.0464 


-0.0589 


0.0248 


0.00 


K 


-0.0101 


-0.0053 


-0.0158 


-0.99 


S 


-0.0433 


-0.0282 


0.0040 


-0.04 


V 


0.4084 


0.2947 


0.3997 


1.22 


R 


0.0363 


0.0394 


-0.0103 


-1.01 


T 


0.0589 


0.0239 


0.1462 


0.26 


P 


0.0019 


-0.0492 


0.0844 


0.75 


I 


0.4172 


0.3805 


0.4238 


1.80 


M 


0.1747 


0.1613 


0.2160 


1.23 


F 


0.4076 


0.4201 


0.3455 


1.79 


Y 


0.3167 


0.3113 


0.2998 


0.96 


C 


0.2745 


0.3557 


0.3222 


1.54 


W 


0.2362 


0.4114 


0.2657 


2.25 


H 


0.0549 


0.0874 


0.1335 


0.13 



TABLE I. Interactivity scales derived in this work. 
In column 2, the parameters are obtained from the 
components of the principal eigenvector of the con- 
tact interaction matrix [20] (see 'Optimal HP'). In col- 
umn 3, the parameters are obtained maximizing the 
mean r(c, h) over single-domain globular proteins (see 
'Parameter optimization'). In column 4, the param- 
eters are obtained by maximizing the mean r(h, h') 
over pairs of sequences sharing the TIM barrel fold 
(see 'Sequence comparison'). Column 5 shows the hy- 
drophobicity scale determined by Fauchere and Pliska 
[24]. 



By comparing Eq. (5) to the definition of the 
PE c of the contact matrix C, Eq. (1), one 
sees that the HP satisfying h{Ai) — const x c^ 
minimizes the energy H{A, C) for a fixed value 

of the sum X]i=i [^(^i)] • We define the opti- 
mal HP of the contact matrix C as the vector 
h minimizing if (A, C) with the constraints of 
fixed ^^ [ft,(^i)] and fixed average hydrophobic- 
ity (h) == N-^J2iH^t)- The latter condition is 
imposed because, if the average hydrophobicity is 
large, the hydrophobic energy will be low not only 
for the native contact map, but also for alternative, 
misfolded structures. Thus (h) has to be reduced 
in order to obtain a sequence with a large normal- 
ized energy gap and a well correlated free energy 
landscape, which is a requisite for fast folding and 
thermodynamic stability, as well as evolutionary 
stability [25-29,21]. 

The exact solution of this optimization prob- 
lem as a function of (h') is involved. Here we 
report only its qualitative features: The normal- 
ized scalar product of the optimal HP and the 
PE equals one if the average hydrophobicity satis- 
fies (h) = (/i)p = {c)\/^~h^, which is the value 
where the energy i?(A, C) is minimal, and then 

decreases proportionally to ((/i) — (h) ) . 

On the basis of the interactivity scale, protein 
sequences in the PDB have an average hydropho- 
bicity (ft.) ~ 0.77(/i) , a value at which the op- 
timal HP is still almost parallel to the PE: The 
scalar product between PE and optimal HP was 
larger than 0.85 in all cases in which we calculated 
it. It is interesting that (h') in globular proteins 
in the PDB is always smaller than (fi) . A larger 
value of (/i) would yield a lower hydrophobic free 
energy H{A,C), but it would decrease the energy 
gap, with the consequence of making folding less 
efficient. 

Since all its components are positive, the PE has 
a large average value (c), which contributes signif- 
icantly to the scalar product. Therefore, it is more 
convenient to measure the similarity between PE 
and HP through their correlation coefficient r(c, h) 
(see Methods). The correlation coefficient between 
the optimal HP and the PE is expected to be close 
to one. This relationship between the PE and the 
optimal HP is very useful to investigate the rela- 
tionship between sequences and structures, and it 
provides a new perspective on protein evolution. 



Evolutionary average of hydrophobicity 
profiles 

We do not expect sequences resulting from an 
evolutionary process to have optimal HP. In a pop- 
ulation with M individuals, natural selection is 
only able to fix advantageous mutations whose se- 
lective advantage is at least of order 1/M. Vari- 
ants with lower (positive) selective advantage are 
effectively neutral and evolve by random genetic 
drift [30]. When the thermodynamic stability of 
the protein is large, mutations improving stabil- 
ity are less likely to occur. Therefore one expects 
that protein stability does not overcome a typical 
value that depends on the population size, the se- 
lection strength, and the mutation rate. Moreover, 
selection for proper function places constraints on 
key amino acids and may prevent proteins to reach 
thermodynamically optimal sequences. 

Nevertheless, we do expect a positive correla- 
tion between HP and PE, since this gives an im- 
portant contribution to the stability of the native 
structure. Therefore, we predict that protein evo- 
lution visits a region in the hydrophobicity space 
centered about the optimal HP. To test this hy- 
pothesis, we average the HP over sets of aligned 
sequences derived from the PFAM, the FSSP, and 
the SON databases (see Methods). 

Table II presents the correlation coefficients 
r(c, h) between the PE of the native structure and 
five HP, for seven protein folds of various lengths. 
First, we calculate r(c, h) using the HP of the na- 
tive sequence in the PDB. The mean value is 0.47 
(fold average). The probability that this correla- 
tion arises by chance is comprised between less 
than 10^'' in the case of rubrcdoxin {N = 53) 
and less than 10^^^ in the case of the TIM bar- 
rel {N = 247). Second, we average r(c,h) over 
all homologous sequences in the same PFAM class, 
obtaining a fold average of 0.45, very similar to the 
previous one. The same procedure similarly gives a 
mean correlation coefficient of 0.45 with sequences 
in the same FSSP class, and 0.45 with sequences 
in the same SCN class (not shown). We then av- 
erage the HP over sequences in the same family 
and use it to calculate r(c, h), finding significantly 
larger values. The mean r(c, h) is 0.57 when the 
HP is averaged over a PFAM class, 0.58 when it is 
averaged over a FSSP class, and 0.96 when it is av- 
eraged over the set of sequences obtained through 
the SCN model. In Fig. 1 we show a scatter plot 
of the PE versus the average HP over the SCN set 
and the FSSP set for the myoglobin fold. 
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FIG. 1. Scatter plot of PE components, d, and HP 
components, h{Ai), the latter ones averaged over two 
sets of sequences with the myoglobin fold: The SCN 
set obtained through simulated evolution (circles) and 
the FSSP set obtained through structure comparison 
(diamonds). Correlation coefficients are 0.96 and 0.46, 
respectively. 

This analysis establishes an important result: 
The evolutionary-averaged HP correlates with the 
PE better than the HP of each individual sequence. 
This suggests that, although individual sequences 
are not optimal, the evolutionary process moves 
around the optimal sequence in sequence space. 
This interpretation is strongly supported by the 
results of the SCN model, for which the average 
HP is very close to the optimal HP, although in- 
dividual sequences have r(c, h) not much different 
from those of the PFAM and FSSP databases. 

In order to get a deeper insight into the mech- 
anism by which the SCN model operates, we have 
simulated the evolution of the seven protein folds of 
Table II by imposing the selective constraint either 
on (a) the normalized energy gap, or (b) the folding 
free energy. The SCN model uses a combination of 
both constraints. When selection is imposed only 
on the energy gap, case (a), the folding free energy 
is higher and the mean correlation coefficient de- 
creases by around 15%. This agrees with our argu- 
ment that the strong correlation between average 
HP and PE arises from minimization of the native 
energy. In contrast, when only the folding free en- 
ergy is tested to accept mutations, case (b), the 
mean correlation coefficient increases only slightly, 
but the energy does not change significantly with 
respect to the SCN model. This is in line with our 
expectation that relaxing the constraint on the en- 
ergy gap increases the correlation between the PE 
and the optimal HP. 

Last, we test the influence of the size of sequence 



database on the value of the correlation coefficient. 
Whereas FSSP and PFAM families consist of a few 
tens or hundreds of very correlated sequences, the 
number of sequences in an SCN family is of tens 
of thousands. The correlation between the PE and 
the average HP decreases if a smaller number of 
sequences is used to calculate the average, and it 
becomes similar to the value obtained for the FSSP 
and PFAM databases when we use only hundreds 
of sequences. 



Parameters optimization 

In the SCN model we assume that Eq. (4) gives 
the exact free energy of the system. From this as- 
sumption, we have obtained here the interactivity 
parameters by diagonalizing the contact interac- 
tion matrix U [20]. However, Eq. (4) is only ap- 
proximate for real proteins. It is therefore possible 
that the correlation between average HP and PE 
can be improved by using another set of 20 param- 
eters. 

To test this possibility, we first calculated the 
correlation coefficient of the PE with the HP ob- 
tained using the hydrophobicity parameters mea- 
sured by Fauchere and Pliska [24] . The correlation 
coefficients between HP and PE remain in this case 
very similar to those calculated above. 

We then optimized the set of 20 parameters by 
maximizing the mean correlation coefficient r(c, h) 
over a subset of the PDB containing single-domain 
globular structures (see Methods). The optimal 
parameters, reported in Table I, were very sim- 
ilar to the parameters obtained from the princi- 
pal eigenvector of the matrix U (correlation co- 
efficient 0.95) and the mean r(c, h) so obtained 
was less than 4% better than the value previously 
obtained. Therefore, the interactivity parameters 
that we are using in this study are almost optimal 
from the point of view of maximizing r(c, h) for 
PDB proteins. 



Sequence comparison 

We have seen that the HP of a protein sequence 
is correlated with the PE of its native structure. 
This implies that the HP of sequences folding into a 
similar structure correlate with each other. There- 
fore, one can measure the similarity of two protein 
sequences through their HP correlation (HC) score, 
defined as 



HC(A,A')=r(h(A),h(A')). 



(6) 



where the correlation r is defined in Eq. (3) and 
h(A) indicates the HP of sequence A. This se- 
quence similarity measure was introduced by Sweet 
and Eisenberg [2]. 

We compared this similarity score with the score 
obtained through the BloSum62 score matrix [17], 
a substitution matrix commonly used in bioinfor- 
matics applications. We applied the two scores 
to the family of sequences sharing the TIM bar- 
rel fold, aligned through the structural alignment 
algorithm Dali [13]. This family possesses many 
sequence pairs whose relationship is very difficult 
to detect on the basis of the sequence alone, even 
with the best current algorithms [18]. 
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FIG. 2. For eacli pair of sequences, we compare tlie 
HC score and the BloSuni62 score. Circles represent 
pairs of sequences with TIM barrel fold; triangles rep- 
resent random sequences with the same length and the 
same number of gaps as the true sequences. Positions 
with gaps are always less than 20% of the sequence, 
and gaps are omitted in the calculation of the scores. 
The line shows a quadratic fit. 

We consider 136 pairs of sequences with the TIM 
barrel fold, as listed in the FSSP database, that 
can be aligned for more than 80% of their length, 
and that have less than 90% sequence identity. For 
each pair, we compare in Fig. 2 the similarity score 
measured through the BloSum62 score and the HC 
score described above. As seen from the figure, the 
correlation between the two scores is very strong, 
despite the fact that the HC score uses only 20 
parameters, whereas the BloSum matrix consists 
of 210 parameters. None of the two scores is able 
to recognize all TIM barrel pairs with respect to 
pairs of random sequences. These results confirm 



the reported correlation between the HP of even 
distantly related proteins with the same fold [2], 
therefore supporting the idea that the optimal HP 
represents the fingerprint of the protein fold. 

Also in this case, one may ask whether the 
correlation can be improved using optimized pa- 
rameters. To address this question, we maxi- 
mized the mean r(h, h') over the set of pairs of 
sequences with TIM barrel fold used above (see 
Methods). The optimization yielded a marginal 
improvement, from 0.497 to 0.511, and the opti- 
mal parameters were almost identical to the for- 
mer ones (correlation coefficient 0.97, see Table I), 
whereas they departed further from the parame- 
ters obtained by maximizing the mean r(c, h) over 
single-domain PDB structures (correlation coeffi- 
cient 0.93). Therefore, the interactivity parame- 
ters given by the principal eigenvector of our in- 
teraction matrix are close to optimal with respect 
to both sequence to structure and sequence to se- 
quence comparison. 



Vectorial Protein Space 

The results presented here show that it is use- 
ful to represent both protein sequences and struc- 
tures as vectors in the same A'^-dimensional space. 
This Vectorial Protein Space is endowed with 
a natural metric through the correlation coeffi- 
cient, Eq. (3). This representation provides a uni- 
fied framework for addressing three issues that are 
central in bioinformatics: sequence to sequence 
alignments [31], structure to structure alignments 
[13,32], and sequence to structure alignments for 
the purpose of protein structure prediction [33] . 

Concerning sequence to sequence alignments, 
we saw that the HC score r(h, h'), already pro- 
posed by Sweet and Eisenberg [2] , performs equally 
well as the BloSum62 score matrix [17] despite 
having 20 parameters instead of 210. This result 
is probably due to the fact that the HC score is 
context dependent: The HC score for substituting 
residue a in sequence A with residue 6 in sequence 
A' does not depend on a and b alone, but also on 
the average and the variance of the HP in the two 
sequences. 

The r(c, c') score for structure to structure 
alignment is strongly correlated with the widely 
used contact overlap. When low energy structures 
generated by threading are compared to the native 
one, the two similarity measures have mean corre- 
lation coefficient 0.93 for the seven protein folds 
studied, i.e. they are almost equivalent. We show 



in Fig. 3 the folds where the two distance measures 
have largest (rubredoxin, 0.98) and smallest (myo- 
globin, 0.87) correlation coefficient. Moreover, this 
score is strongly correlated with the effective en- 
ergy function, which is an important requisite of 
suitable measures of structural similarity [34] . 
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FIG. 3. Scatter plot of the contact overlap g(C, C) 
(upper panel) and of the sequence to structure score 
r(c,h) (lower panel) vs structural similarity measured 
by r(c,c'). Here h indicates the HP of the protein 
sequence, c' indicates the PE of the native state and 
c indicates the PE of alternative states generated by 
threading with gap penalties. In both cases, the two 
folds presenting largest and smallest correlation are 
shown. 

Finally, the HP-PE correlation score r(c, h) for 
sequence to structure alignment is strongly cor- 
related with the effective free energy of the struc- 
ture from which the PE is obtained and with the 
similarity between this structure and the native 



one (see Fig. 3). Different from our effective free 
energy function Eq. (4), this score does not as- 
sign the highest value to the native structure with 
respect to decoys generated by threading. Never- 
theless, the highest scoring decoys are very simi- 
lar to the native ones, particularly in the case of 
large proteins. Moreover, all structures similar to 
the native tend to have high r(c, h) score. There- 
fore, this score may be useful for rapidly screening 
from a large database a restricted number of candi- 
date structures for more accurate fold recognition 
techniques, as expected based on the success of se- 
quence to structure comparison methods based on 
profiles [3,33]. 



DISCUSSION 

The relationship between hydrophobicity and 
pair potentials has been studied by various groups, 
including Li et al. [22] , Betancourt and Thirumalai 
[35] and more recently Cline et al. [36]. Here we 
show that the principal eigenvector of the contact 
matrix of the native structure (PE), a global in- 
dicator of protein structure, is positively corre- 
lated with the hydrophobicity profile (HP) of its 
sequence and, more strongly, with the average HP 
of sequences adopting the same fold. 

The hydrophobicity scale used in this work was 
derived from the principal eigenvector of the effec- 
tive pair interaction matrix by BastoUa et al. [20]. 
We have called it interactivity scale to underline 
the fact that it embodies also other kinds of inter- 
actions besides the hydrophobic effect. Its strong 
correlation with empirical hydrophobicity scales 
(the correlation \s R — 0.85 with the octanole scale 
by Fauchere and Pliska) and the demonstration 
that pair potentials are dominated by hydropho- 
bicity justify our simultaneous use of the name hy- 
drophobicity profile. One should note, however, 
that the word hydrophobicity is not used here with 
its strict biochemical meaning. 

After this paper was submitted, we became 
aware of a recent and interesting paper where 
'buriability' parameters were derived for each 
amino acid from the thermodynamic effects of 
site-directed mutagenesis [37]. Interestingly, these 
buriability parameters are almost identical to our 
interactivity scales, with a correlation coefficient 
of 0.92 with the scale derived from the principal 
eigenvector of the interaction matrix and a cor- 
relation coefficient of 0.98 with the scale derived 
optimizing the HP-PE correlation. 



The HP can be useful for recognizing and align- 
ing distantly related sequences, as proposed by 
Sweet and Eisenberg [2] , and for aligning sequences 
and structures of related proteins. It is interesting, 
and perhaps surprising, that the hydrophobicity 
parameters obtained from the principal eigenvec- 
tor of our interaction matrix [20] are almost opti- 
mal for both purposes, since they almost maximize 
at the same time the correlation between distant 
sequences sharing the same fold and the correla- 
tion between the PE of single-domain globular pro- 
teins and the HP of their sequences. Therefore, the 
results presented in this paper can help develop- 
ing new bioinformatics methods and algorithms to 
unify and perhaps improve different kinds of align- 
ments. 

From Eq. (5), one may expect a correlation be- 
tween the PE and the HP much stronger than the 
one observed, which is in the range 0.4 to 0.6. In 
fact, the contact matrix with lowest effective free 
energy, Eq. (4), is characterized by r(c, h) close to 
one, since the contact matrix with r(c, h) = 1 min- 
imizes the effective hydrophobic energy, Eq. (5), 
for a fixed value of the principal eigenvalue Ai , ex- 
pressing the number of contacts per residue, and 
in turn Eq. (5), gives the most important contri- 
bution to the effective contact free energy. The 
ground state of our protein model, which we iden- 
tify with the native state, is the protein-like struc- 
ture having the lowest effective energy. By protein- 
like we mean that local structure, dihedral angles, 
excluded volume interactions, local electrostatic 
interactions, hydrogen bonds, etc., are distributed 
as in native protein structures. These conditions 
are not enforced through the effective energy func- 
tion, but they are obtained constraining candidate 
structures to fragments of protein crystal struc- 
tures. By threading protein sequences against the 
whole PDB database with a suitable gap penalty, 
we never found any PE of protein contact matri- 
ces whose correlation with the HP was close to one. 
Such a contact matrix would have lower effective 
free energy than the true native state, whose cor- 
relation with the HP is of the order of 0.5. 

In the case of small proteins, as rubredoxin 
(N = 53, see lower panel of Fig. 3), one can find 
alternative structures very different from the na- 
tive one with r(c, h) larger than for the native 
structure, but still much smaller than one. But 
for proteins with more than 100 residues, such as 
for instance the TIM barrel {N = 247, Fig. 3) 
all the structures with large correlation r(c, h) are 
very similar to the native one. This results suggest 
that, for HP of natural proteins, protein-like struc- 



tures having r(c, h) close to one do not exist, and 
open the question of why large regions of the A'^- 
dimensional Vectorial Protein Space do not seem 
to contain vectors that are the PE of the contact 
matrix of some protein-like structure. 

The only moderate correlation between HP and 
PE has deep implications on protein evolution. 
The requirement of a strong correlation would im- 
ply that only sequences very similar to a protein 
fold (in the sense of the correlation coefficient) 
are compatible with this fold. This would con- 
trast with the observation that the distribution of 
sequence identity for proteins adopting the same 
fold approaches the distribution for random pairs 
of sequences [38] . Further analysis will be needed 
to understand the relationship between these two 
observations. 

To gain further insight into the sequence- 
structure relationship, we have defined the optimal 
HP of a given fold. Our simple model of protein 
folding, Eq. (4), leads to the prediction that the 
optimal HP is strongly correlated with the PE. 
Notice that this calculation also provides an an- 
alytical solution to sequence design based on the 
effective energy function Eq. (5). 

We expect that sequences sharing a given fold 
can not be too different from the optimal HP, al- 
though this HP is never actually attained during 
protein evolution. This interpretation is strongly 
supported by an analysis of sequences derived from 
our SCN model of evolution. In this case, the PE of 
the (fixed) reference structure is moderately corre- 
lated with the HP of individual sequences, but it is 
very strongly correlated with the average of the HP 
over all sequences. This suggests that our model 
of neutral evolution can be described as a motion 
in sequence space around the optimal HP. 

The same result is obtained considering sets 
of homologous sequences (PFAM) and sets of se- 
quences sharing the same fold (FSSP): The HP av- 
eraged over these sequences is more strongly corre- 
lated with the PE than the HP of each individual 
sequence. By means of this result, evolutionary in- 
formation may give a valuable contribution to the 
goal of predicting protein structure using the PE 
as an intermediate step. 

The fact that these correlations are much weaker 
for PFAM and FSSP families than for the SCN 
model is not unexpected. First, functional con- 
straints are important in protein evolution, but 
they are not represented in the SCN model. These 
may cause conservation of amino acids that are not 
optimal from a thermodynamic point of view. Sec- 
ond, our structural model only considers interac- 
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tions between residues in the same chain, whereas, 
for several of the proteins that we studied, interac- 
tions with cofactors and with other protein chains 
play an important role. The presence of disul- 
phide bridges and iron-sulfur clusters is also prob- 
lematic. Third, the FSSP and PFAM databases 
contain much less sequences than those obtained 
through simulated evolution. Decreasing the size 
of the SCN set, the correlation with the PE also 
decreases. 

Part of the difference between observed and sim- 
ulated protein evolution may also be due to the 
fact that the free energy function in Eq. (4) is not 
sufficiently accurate to describe real proteins. Us- 
ing this equation, we are neglecting other kinds of 
interactions relevant for protein stability, such as 
hydrogen-bonding, packing interactions, and cn- 
tropic contributions to the native free energy. It 
is possible that these neglected interactions are 
only partially averaged out when summing over 
a large set of homologous proteins, even if they 
contribute substantially to protein folding thermo- 
dynamics. This may result in an increased, yet not 
perfect correlation between the average HP and the 
PE. Alternatively, the hydrophobicity itself might 
be context-dependent, for instance influenced by 
neighboring residues. This could lead to what is 
indeed observed, i.e. a significant but not com- 
plete correlation between HP and PE. However, we 
found that the parameters that maximize the cor- 
relation between HP and PE are very strongly cor- 
related with the interactivity parameters obtained 
from the principal eigenvector of the interaction 
matrix. This result supports the view that the 
interaction matrix that we are using describes an 
important contribution to protein stability. 

Another interesting property of the interactiv- 
ity scales that we have derived here, besides their 
expected strong correlation with protein folding 
thermodynamics, is that the interactivity of or- 
thologous protein also correlates very strongly with 
properties of the genomes in which these proteins 
are expressed (U. Bastolla et a/., submitted). 

The relationship between HP and PE suggests 
that proteins of low sequence similarity may share 
the same fold provided their HP are correlated 
with the optimal HP. Therefore the optimal HP 
constitutes a common hydrophobic fingerprint that 
characterizes a protein fold. 
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Protein 


PDB id. 


N 


r(c,hpDB) 










r(c, hpFAM) 


r(c, hpFAM) 


r(c,hFssp) 


r(c,hscN) 


rubredoxin 


liro/lbrfA 


53 


0.496 


0.465 


0.602 


0.599 


0.987 


cytochrome c 


451c 


82 


0.500 


0.491 


0.573 


0.684 


0.962 


ribonuclease 


7rsa 


124 


0.431 


0.400 


0.491 


0.498 


0.965 


lysozyme 


31zt 


129 


0.531 


0.544 


0.649 


0.627 


0.949 


myoglobin 


la6g 


151 


0.399 


0.352 


0.472 


0.465 


0.966 


ubiquitin conjugating enz. 


lu9aA 


160 


0.451 


0.450 


0.593 


0.567 


0.943 


TIM barrel 


7timA 


247 


0.466 


0.404 


0.486 


0.584 


0.970 



TABLE II. Correlation coefficients between the PE and the HP, r(c,h) for seven protein folds of different 
length A'". In column 4, the HP is obtained from the PDB sequence from which the PE is calculated. In column 5, 
the r is averaged over different PFAM sequences. Single-sequence r's are calculated using the alignments between 
the PDB sequence and sequences from the same PFAM family. Mean values do not differ significantly from those 
for the original PDB sequence. Similar values are obtained averaging r over sequences in the FSSP and SCN 
families (not shown). In the remaining columns, the mean HP is obtained by averaging the HP of sequences 
in the same family of the PFAM database (column 6), the FSSP database (column 7), and the SCN database 
(column 8) , respectively. Most values of the PFAM and FSSP databases are very similar to each other and both 
higher than the value of r for individual sequences, but significantly smaller than the corresponding values for 
the SCN database. Note that for the correlation coefficient based on the average HP, sites containing cysteine 
residues are not included, as they form pairwise disulphide bridges (which are very poorly represented through 
the hydrophobic energy) and are strictly conserved in our evolutionary model. 
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